Show the code
import pandas as pd
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
# import your data here using pandas and the URL
import pandas as pd
ml_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
neigh_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"
info_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv"
ml = pd.read_csv(ml_url)
neigh = pd.read_csv(neigh_url)
info = pd.read_csv(info_url)
homes = ml.merge(neigh, on="parcel", how="left")
homes.head()| parcel | abstrprd | livearea | finbsmnt | basement | yrbuilt | totunits | stories | nocars | numbdrm | ... | nbhd_802 | nbhd_803 | nbhd_804 | nbhd_805 | nbhd_901 | nbhd_902 | nbhd_903 | nbhd_904 | nbhd_905 | nbhd_906 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00102-08-065-065 | 1130 | 1346 | 0 | 0 | 2004 | 1 | 2 | 2 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 00102-08-073-073 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 00102-08-078-078 | 1130 | 1346 | 0 | 0 | 2005 | 1 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 00102-08-081-081 | 1130 | 1146 | 0 | 0 | 2005 | 1 | 1 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 00102-08-086-086 | 1130 | 1249 | 0 | 0 | 2005 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 324 columns
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
These charts were all showing relationships between different things that were all revolving around the numbers 0-1 mainly. Some wasy to improve the algorithms would be to do totals instead of just ratios or percentages. Also, some of the categorical variables could be grouped better to reduce the number of categories and make it easier for the model to learn patterns.
# Include and execute your code here
# Initialize Lets-Plot
LetsPlot.setup_html(isolated_frame=True)
# Select a few features to visualize
numeric_features = ['livearea', 'yrbuilt', 'stories']
categorical_features = ['gartype_Att', 'condition_Good']
# 1. Scatter
p1 = ggplot(homes, aes(x='yrbuilt', y='before1980')) + \
geom_jitter(width=0, height=0.02, alpha=0.3) + \
geom_smooth(method="loess", se=True, color='red') + \
ggtitle("Year Built vs Before1980 (LOESS Trend)") + \
xlab("Year Built") + ylab("Before1980")
# 2. Boxplot
p2 = ggplot(homes, aes(x='before1980', y='livearea')) + \
geom_boxplot() + \
ggtitle("Living Area vs Before1980") + \
xlab("Before1980") + ylab("Living Area (sq ft)")
# 3. Bar Chart
homes['before1980_str'] = homes['before1980'].astype(str)
p3 = ggplot(homes, aes(x='gartype_Att', fill='before1980_str')) + \
geom_bar(position="dodge") + \
ggtitle("Garage Type (Attached) vs Before1980") + \
xlab("Attached Garage") + ylab("Count") + \
scale_fill_manual(values=["#1f77b4","#ff7f0e"], name="Before1980")
# show plots
p1.show()
p2.show()
p3.show()Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
I built a Random Forest classifier to predict whether a house was built before 1980. This algorithm was chosen because it handles both numeric and categorical features well and is easily interpretted. The model achieved 100 percent accuracy because the year the house was built was 100% predictive of telling if it was built before 1980 or not. It was more of just a pass/fail test since the target variable is directly derived from the year built.
# Include and execute your code here
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Prepare features and target
X = homes.drop(columns=['parcel', 'before1980']) # drop non-predictive IDs and target
y = homes['before1980']
# Convert all boolean/categorical columns to numeric
X = pd.get_dummies(X, drop_first=True)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Scale numeric features
scaler = StandardScaler()
X_train[X_train.select_dtypes(include=['float64','int64']).columns] = scaler.fit_transform(
X_train.select_dtypes(include=['float64','int64'])
)
X_test[X_test.select_dtypes(include=['float64','int64']).columns] = scaler.transform(
X_test.select_dtypes(include=['float64','int64'])
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Random Forest with basic tuning
rf = RandomForestClassifier(
n_estimators=200,
max_depth=15,
min_samples_split=5,
min_samples_leaf=2,
random_state=42
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print(classification_report(y_test, y_pred))
X_simple = homes[['yrbuilt']]
y = homes['before1980']
X_train, X_test, y_train, y_test = train_test_split(X_simple, y, test_size=0.2, random_state=42, stratify=y)
from sklearn.ensemble import RandomForestClassifier
rf_simple = RandomForestClassifier(n_estimators=100, random_state=42)
rf_simple.fit(X_train, y_train)
print("Accuracy using only yrbuilt:", rf_simple.score(X_test, y_test))Accuracy: 1.000
precision recall f1-score support
0 1.00 1.00 1.00 2131
1 1.00 1.00 1.00 3462
accuracy 1.00 5593
macro avg 1.00 1.00 1.00 5593
weighted avg 1.00 1.00 1.00 5593
Accuracy using only yrbuilt: 1.0
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
This analysis shows that the most important feature for predicting whether a house was built before 1980 is the year it was built. This makes sense since the target variable is directly derived from this feature. Other features such as living area and garage type have much lower importance, indicating they contribute less to the model’s predictions.
# Include and execute your code here
# =============================
# Feature importance plot (lets_plot) — fixed version
# =============================
from lets_plot import *
from sklearn.ensemble import RandomForestClassifier
# Ensure Lets-Plot is initialized
LetsPlot.setup_html()
# --- Get feature importances (handle missing rf variable) ---
try:
importances = rf.feature_importances_
feature_names = X.columns
except Exception as e:
# If rf doesn't exist, train a quick RF on the existing X/y (safe fallback)
print("Warning: 'rf' not found or not usable. Training a fallback RandomForest for importances.")
rf_tmp = RandomForestClassifier(n_estimators=200, random_state=42)
# if you only have X and y already split, they should exist; otherwise try ml -> build X,y
try:
rf_tmp.fit(X, y)
except Exception as e2:
raise RuntimeError("Couldn't train fallback RF. Make sure X and y are defined (features and target).") from e2
importances = rf_tmp.feature_importances_
feature_names = X.columns
# Build DataFrame of importances
fi = pd.DataFrame({
"feature": feature_names,
"importance": importances
})
# Sort descending and keep top 20
fi = fi.sort_values("importance", ascending=False).head(20).reset_index(drop=True)
# Make the feature column a categorical with the exact display order (lets_plot respects category order)
fi['feature'] = pd.Categorical(fi['feature'], categories=fi['feature'].tolist(), ordered=True)
# Plot with lets_plot (use geom_bar with stat='identity')
plot = (
ggplot(fi, aes(x='feature', y='importance')) +
geom_bar(stat='identity', fill="#4C72B0") +
coord_flip() +
labs(
title="Top 20 Feature Importances (Random Forest)",
x="Feature",
y="Importance"
) +
theme_minimal()
)
plotDescribe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
type your results and analysis here
# Include and execute your code here
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
# Split data (keep all columns consistent)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale numeric features
num_cols = X_train.select_dtypes(include=['float64','int64']).columns
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
# Train Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=15, min_samples_split=5, min_samples_leaf=2, random_state=42)
rf.fit(X_train, y_train)
# Predict
y_pred = rf.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print("=== Model Quality Metrics ===")
print(f"Accuracy : {accuracy:.3f} (proportion of correct predictions)")
print(f"Precision: {precision:.3f} (correctness among predicted positives)")=== Model Quality Metrics ===
Accuracy : 1.000 (proportion of correct predictions)
Precision: 1.000 (correctness among predicted positives)
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
type your results and analysis here
# Include and execute your code hereJoin the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.
type your results and analysis here
# Include and execute your code hereCan you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.
type your results and analysis here
# Include and execute your code here